Chuyển sang Thị giác Máy tính: Vì sao lại là CNN?

Chuyển sang Thị giác Máy tính

Hôm nay chúng ta chuyển từ xử lý dữ liệu đơn giản, có cấu trúc bằng các lớp tuyến tính cơ bản sang giải quyết dữ liệu hình ảnh ở chiều cao. Một hình ảnh màu đơn lẻ đã tạo ra sự phức tạp đáng kể mà các kiến trúc tiêu chuẩn không thể xử lý hiệu quả. Học sâu cho thị giác cần một phương pháp chuyên biệt: mạng nơ-ron tích chập (CNN). Mạng nơ-ron tích chập (CNN).

1. Tại sao Mạng nơ-ron đầy đủ kết nối (FCN) lại thất bại?

Trong FCN, mỗi pixel đầu vào phải được kết nối với mọi nơ-ron ở tầng tiếp theo. Với hình ảnh độ phân giải cao, điều này dẫn đến bùng nổ về tính toán, khiến việc huấn luyện trở nên bất khả thi và khả năng tổng quát kém do hiện tượng quá khớp nghiêm trọng.

Input Dimension: A standard $224 \times 224$ RGB image results in $150,528$ input features ($224 \times 224 \times 3$).
Hidden Layer Size: If the first hidden layer uses 1,024 neurons.
Total Parameters (Layer 1): $\approx 154$ million weights ($150,528 \times 1024$) just for the first connection block, requiring massive memory and compute time.

Giải pháp từ CNN

CNN giải quyết vấn đề mở rộng của FCN bằng cách khai thác cấu trúc không gian của hình ảnh. Chúng nhận diện các mẫu (như đường viền hoặc đường cong) thông qua các bộ lọc nhỏ, giảm số lượng tham số hàng chục lần và tăng độ bền.

TERMINALbash — model-env

> Ready. Click "Run" to execute.

PARAMETER EFFICIENCY INSPECTOR Live

Run comparison to visualize parameter counts.

Question 1

What is the primary benefit of using Local Receptive Fields in CNNs?

Filters only focus on a small, localized region of the input image.

It allows the network to process the entire image globally at once.

It ensures all parameters are initialized to zero.

It removes the need for activation functions.

Question 2

If a $3 \times 3$ filter is applied across an entire image, what core CNN concept is being utilized?

Kernel Normalization

Shared Weights

Full Connectivity

Feature Transposition

Question 3

Which CNN component is responsible for progressively reducing the spatial dimensions (width and height) of the feature maps?

ReLU Activation

Pooling Layers (Subsampling)

Batch Normalization

Challenge: Identifying Key CNN Components

Relate CNN mechanisms to their functional benefits.

We need to build a vision model that is highly parameter efficient and can recognize an object even if it slightly shifts its position in the image.

Step 1

Which mechanism ensures the network can identify a feature (like a diagonal line) regardless of where it is in the frame?

Solution:
Shared Weights. By using the same filter across all locations, the network learns translation invariance.

Step 2

What architectural choice allows a CNN to detect features with fewer parameters than an FCN?

Solution:
Local Receptive Fields (or Sparse Connectivity). Instead of connecting to every pixel, each neuron only connects to a small, localized region of the input.

Step 3

How does the CNN structure lead to hierarchical feature learning (e.g., edges $\to$ corners $\to$ objects)?

Solution:
Stacked Layers. Early layers learn simple features (edges) using convolution. Deeper layers combine the outputs of earlier layers to form complex, abstract features (objects).